Back

Data in Brief

Elsevier BV

Preprints posted in the last 90 days, ranked by how well they match Data in Brief's content profile, based on 13 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.

1
High-quality proteins and RNAs extracted from exact same samples for proteomics and RNA-Seq analyses

Fatou, M.; Kornobis, E.; Douche, T.; Druart, K.; Puchot, N.; Matondo, M.; Monot, M.; Bourgouin, C.

2026-01-19 molecular biology 10.64898/2026.01.16.699903 medRxiv
Top 0.1%
3.3%
Show abstract

Back to the 1990 the single step method developed by Chomczynski and Sacchi for RNA isolation was extended for sequential isolation of RNA, DNA and proteins from a same sample. Although the quality of the extracted RNA turned compatible with RNA-Seq analyses, the extraction of the protein fraction from the same sample was time-consuming and resulting in low yield and quality of proteins not compatible with LC-MS proteomic analyses. Here we report a novel procedure by isolating in parallel the protein fraction and the RNA fraction from the same exact minute mosquito samples. We provide evidence that each cognate fractions are compatible with LC-MS proteomic analysis on the one hand and RNA-Seq analysis on the other hand. This protocol is simple, time efficient and adequate for studies involving limited sample size and could be applied easily to a broad range of animal and human samples.

2
High-throughput targeted paleoproteomics sex estimation on medieval Great Moravia individuals using MALDI-CASI-FTICR mass spectrometry

Bray, F.; Pilmann Koterova, A.; Garbe, L.; Haegelin, M.; Bertrand, B.; Agossa, K.; Rolando, C.; Veleminsky, P.; Bruzek, J.; Morvan, M.

2026-02-18 evolutionary biology 10.64898/2026.02.17.706309 medRxiv
Top 0.1%
2.7%
Show abstract

The estimation of the biological sex of archeological remains is crucial information in bioarchaeology and forensic anthropology. In recent years, proteomics based on molecular sexual dimorphism have emerged as a preferred method, particularly because of its minimally-invasive approach to extracting amelogenin X and Y proteins from tooth enamel. However, there is an increasing demand to accelerate this process while facilitating the analysis of large archaeological assemblages. This study presents a novel high-throughput targeted paleoproteomics method for biological sex estimation using MALDI-CASI-FTICR mass spectrometry. This approach combines the strengths of existing methods, including ultra-high resolution, significantly reduced processing times, targeted analysis, and scalability to large archaeological sample sets. The method was initially validated on modern individuals with known sex and subsequently applied to 130 adult and juvenile individuals from medieval Great Moravia (present-day Czech Republic). Biological sex was successfully estimated for all but one of the individuals. The results not only provide a more efficient biological sex estimation but also help to resolve a few errors in sex assessment previously encountered with osteomorphological and tooth morphometric techniques. The implementation of this method significantly improves the accuracy and efficiency of biological sex estimation, offering a powerful tool for anthropological research. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=79 SRC="FIGDIR/small/706309v1_ufig1.gif" ALT="Figure 1"> View larger version (33K): org.highwire.dtl.DTLVardef@1ede7e6org.highwire.dtl.DTLVardef@13d2f5org.highwire.dtl.DTLVardef@17ee44dorg.highwire.dtl.DTLVardef@1be9dd9_HPS_FORMAT_FIGEXP M_FIG C_FIG

3
Tracing mobility among Eneolithic-Bronze Age Kurgan populations in the North Pontic steppe

Nikitin, A. G.; Renson, V.; Ivanova, S.; Neff, N. C.; Straioto, H.; Svyryd, S.

2026-03-24 evolutionary biology 10.64898/2026.03.21.713323 medRxiv
Top 0.1%
2.6%
Show abstract

Five millennia ago, nomadic people from the North Pontic steppe left a profound impact on the course of Eurasian prehistory. However, little is known about their mobility patterns within their home region. To address this knowledge gap, we conducted a survey of the strontium isotope landscape of people interred in the 4th-3rd millennium BCE burial mounds (kurgans) of the western part of the North Pontic steppe. By analyzing the strontium signature in human bone and dentin, we established strontium baseline values for the region. We subsequently correlated enamel strontium ratios from 25 selected individuals with the baseline obtained and with published strontium data across the North Pontic steppe. Enamel strontium ratios show that some individuals interred in the northwest North Pontic fall within the regional baseline range, whereas others overlap with values reported for the eastern North Pontic steppe. In conjunction with carbon ({delta}13C) and nitrogen ({delta}15N) stable isotope data, we further determined that some individuals interred in the western Pontic steppe either spent the later part of life in the west Caspian steppe or were affected by physiological stress during lifetime. By integrating our data with published isotopic datasets, we produced a first baseline heatmap of the North Pontic steppe for the c. 4000-2000 BCE chronological period.

4
Development and fit for purpose validation of a quantitative LC-MS/MS method for heparan sulfate in cerebrospinal fluid as a biomarker for mucopolysaccharidosis type IIIA

Bystrom, C.; Douglass, K.; Gupta, M.

2026-03-30 genetic and genomic medicine 10.64898/2026.03.27.26348847 medRxiv
Top 0.1%
2.1%
Show abstract

Background: Mucopolysaccharidosis type IIIA (MPS IIIA; Sanfilippo syndrome) is a fatal neurodegenerative lysosomal storage disorder caused by impaired degradation of heparan sulfate (HS). Despite rapid advances in gene and enzyme therapies, there remains a critical need for an analytically validated, quantitative biomarker that accurately reflects central nervous system (CNS) substrate burden. Such biomarker would be a valuable tool in assessing disease progression and monitoring therapeutic efficacy. Objective: This study describes the method development, fit for purpose validation, and preliminary clinical application of a quantitative liquid chromatography-mass spectrometry (LC-MS/MS) assay for the HS-derived disaccharide N-sulfoglucosamine-glucuronic acid (GlcNS-GlcUA) in human cerebrospinal fluid (CSF), a critical biomarker for diagnosis, disease monitoring, and regulatory evaluation of emerging MPS IIIA therapies. Methods: A structurally defined GlcNS-GlcUA reference standard and its [13C6]-labeled internal standard were used in a derivatization and detection workflow employing 1-phenyl-3-methyl-5-pyrazolone labeling, and LC-MS/MS. Results: The method exhibited acceptable linearity across 0.005-0.500 nmol/mL (r[≥]0.9976), with intra- and inter-assay imprecision [≤]3.5%CV and accuracy within 95%-110% of nominal concentrations. No matrix or hemolysis interference or carryover was observed, and the analyte remained stable during freeze-thaw storage conditions. Application of the method to 12 CSF samples from patients with MPS IIIA demonstrated quantifiable GlcNS-GlcUA levels ranging from 0.0054 to 0.106 nmol/mL, confirming suitability for clinical and regulatory use. Comparison of the MPS IIIA sample results between the development laboratory and the contract research organization laboratory support robust inter-lab assay transfer. Conclusions: This validated LC-MS/MS method establishes a regulatory-grade quantitative assay for measurement of CSF HS in MPS IIIA. Its high analytical sensitivity and reproducibility enable reliable assessment of CNS substrate reduction and pharmacodynamic response, supporting biomarker-driven therapeutic development and accelerated approval pathways for neuronopathic mucopolysaccharidoses.

5
Systematic reviews in minutes to hours using artificial intelligence

Bakker, L.; Caganek, T.; Rooprai, A.; Hume, S.

2026-02-10 health informatics 10.64898/2026.02.06.26345764 medRxiv
Top 0.1%
1.9%
Show abstract

Systematic reviews are used in academia, biotechnology, pharmaceutical companies and government to synthesise and appraise large numbers of publications. The current (largely manual) workflow takes an average of 9-18 months1, at a cost of $100,000+ per review2. We built a platform, ScholaraAI, that leverages artificial intelligence to cut this to < 0.1% of the time, without compromising quality. ScholaraAI facilitates end-to-end systematic reviews; search, screening, data extraction, and analysis. The workflow is transparent, and the researcher is in the loop. Our approach is compliant with the PRISMA and RAISE frameworks. Compared to a benchmarking set of published systematic reviews, ScholaraAIs sensitivity for correctly included studies is 100% {+/-} 0%, its specificity for correctly excluded studies is 90.8 {+/-} 8.6%, and its accuracy for data extraction is 98.0 {+/-} 3.5%. The time taken per review was 3.67 hours {+/-} 1.26. We used ScholaraAI to produce a novel, up-to-date systematic review and meta-analysis, which is presented here. ScholaraAI is free to try at app.scholara.ai.

6
Estimation of Heavy Metal Contamination in Selected Marine Fish in Bangladesh and Their Health Impact

Rahaman, M. A.; Jahan, I.; Alam, S. S.; Shill?, L. C.; Dihan, M. A. M.; Al Mamun, M. A.

2026-02-04 public and global health 10.64898/2026.02.02.26345413 medRxiv
Top 0.1%
1.8%
Show abstract

BackgroundSea fish traditionally serves as a protein source and plays a crucial and indispensable role in providing nutrition for the people of Bangladesh. However, frequent consumption may potentially indicate health risks through contamination with toxic heavy metals. The main purpose of this study is to evaluate the levels of heavy metal concentrations (Cr, Fe, Ni, Mn, Cu, and Pb) in selected sea fish from Chattogram and Coxs Bazar districts in Bangladesh. MethodsA wet digestion technique was employed to prepare the samples for analyzing heavy metals. Atomic Absorption Spectrophotometry (AAS) in flame and furnace technique was utilized for the estimation of heavy metal content. The health risk of human was evaluated grounded on Estimated Daily Intake (EDI), Target Hazard Quotient (THQ), Total Target Hazard Quotient (TTHQ) or Hazard Index (HI), and Target Cancer Risk (CR). ResultThe descending chronology of average concentrations for the selected heavy metals was as follows: Fe (32.36) > Ni (12.12) > Pb (9.70) > Cu (7.29) > Mn (5.94) > Cr (5.22). The correlations (r0.587) between Cr and Mn were found significantly positive which indicated the parameters were interconnected with each other and likely have a common origin within the study area. EDI values of four samples in the case of Cr and six values for Pb exceeded the reference doses (RfD) which included Bombay Duck, Ilish, Silver Pomfret, Longfin Tuna, Indian Threadfin, and Bigeye Ilisha. In six sea fish samples, the THQ for Cr and Pb crossed the allowable limit of 1. The TTHQ/HI values for seven fish species were higher than 1 ranging from Bigeye Ilisha (3.25) to Indian Mackerel (1.35). The CR values for the majority of the heavy metals fell within an acceptable range. ConclusionsFrom a public health perspective, this study revealed that continuous consumption of heavy metals, resulting in non-oncogenic and oncogenic health implications as well.

7
Archaeological preservation of amelogenesis pathways

Asmundsdottir, R. D.; Troche, G.; Olsen, J. V.; Martinez de Pinillos, M.; Martinon-Torres, M.; Schrader, S.; Welker, F.

2026-03-26 evolutionary biology 10.64898/2026.03.25.713862 medRxiv
Top 0.1%
1.7%
Show abstract

Dental enamel, the hardest mineralised tissue in the human body, has proven to be an excellent source of ancient proteins, which have been found to survive within dental enamel for at least twenty million years. In archaeological and palaeontological contexts, the enamel proteome is generally considered to be rather small, consisting of about twelve proteins, most of which are unique to enamel. During amelogenesis these proteins undergo in vivo digestion by matrix metalloproteinase 20 (MMP20) and kallikrein 4 (KLK4) as well as serine phosphorylation by family with sequence similarity member 20-C (FAM20C) that alter their characteristics. Gaining knowledge of the previously understudied influence of amelogenesis on the archaeological human dental enamel proteome could benefit various palaeoproteomic analysis, especially in an human evolutionary context. Here we present archaeological dental enamel proteomes and explore protein cleavage patterns and sequence coverage to estimate the effects of in vivo digestion, as well as explore phosphorylation patterns. Additionally, we present a new marker based on phosphorylation to estimate genetic sex.

8
Historical Perspectives in Medicine using a Large Language Model: Emulating an 18th Century Physician

Malladi, P.; Eaton, J.; Gleichgerrcht, E.; Chatzistamou, I.; Roark, K.; Kennedy, S. W.; Bonilha, L.

2026-02-12 medical education 10.64898/2026.02.10.26345990 medRxiv
Top 0.1%
1.7%
Show abstract

IntroductionEighteenth-century medical texts document a formative period in the evolution of clinical reasoning, yet their integration into modern medical education is limited. The traditional approach to learning the history of medicine has naturally focused on passive reading, but new approaches using AI could enable learners to interrogate and simulate the historical diagnostic logic and therapeutic paradigms. More specifically, large language models (LLMs) offer an opportunity to create interactive simulations that allow experiential engagement with historical medical reasoning. MethodsWe developed a historically constrained LLM-based educational platform designed to emulate the diagnostic reasoning, language, and conceptual frameworks of an 18th-century physician. A modern GPT architecture was customized using strict instruction-based constraints and limited exclusively to a curated corpus of six foundational 17th- 18th century medical texts. Guardrails were implemented to prevent anachronistic terminology and modern medical concepts. Model outputs were evaluated qualitatively by comparing the models diagnoses and treatment plans with published diagnoses and treatment from original 18th century sources. We also applied the simulation to modern clinical vignettes for an illustrative contrast between modern and 18th century approaches. ResultsThe model generated responses that closely aligned with 18th-century medical and rhetorical style, as well as therapeutic reasoning. When presented with historical cases, the simulation demonstrated strong concordance with original diagnoses and management strategies. Secondly, when applied to modern cases, the model described period-appropriate reasoning, highlighting clear contrasts with contemporary biomedical reasoning. ConclusionsAI broadly, and more specifically LLMs configured as historically constrained simulators, can function as effective tools for learning in medical history. This approach could enable active engagement with historical clinical reasoning, fostering critical reflection on the contingent and evolving nature of medical knowledge. Such temporal simulations hold promise for medical humanities education and interdisciplinary teaching.

9
Performance of Road-Traffic-Based Exposure Proxies Against Personal PM2.5 Measurements in Three Sub-Saharan African Countries

Nyoni, H. B.; Mushore, T. D.; Munthali, L.; Makhanya, S. A.; Chikoko, L.; Luchters, S.; Chersich, M. F.; Machingura, F.; Makacha, L.; Barratt, B.; Mistry, H. D.; Volvert, M.-L.; von Dadelszen, P.; Roca, A.; D'alessandro, U.; Temmerman, M.; Sevene, E.; Govindasamy, T. R.; Makanga, P. T.; The PRECISE Network, ; The HE<sup>2</sup>AT Centre,

2026-03-17 public and global health 10.64898/2026.03.13.26348337 medRxiv
Top 0.1%
1.7%
Show abstract

IntroductionParticulate Matter (PM2.5) exposure contributes to the global disease burden, yet its monitoring remains sparse and uneven and is limited in many limited ground monitoring network settings. Road-traffic proxy indicators can provide indirect estimates of PM2.5 where measurements are limited but require context-specific validation. We evaluated three PM2.5 road-traffic related proxies:(I) population-Weighted Road Network Density (WRND), (ii) Euclidean (straight line) distance from highways (EH), and (iii) Euclidean distance from main roads (EM). MethodsWe validated proxies using high-resolution outdoor filtered PM2.5 personal exposure measurements collected over 1 year from 343 postpartum participants in The Gambia, Kenya, and Mozambique. Village-level spatial patterns for the PM2.5-proxy relationship were mapped using 5 km hexagonal aggregated tessellations. Proxy-PM2.5 associations were assessed using Spearman correlation, and predictive utility was tested using country-specific and global Random Forest (RF) models (3-fold cross-validation), reporting R2, RMSE, and feature importance ResultsSpatial mapping showed heterogeneous proxy-PM2.5 relationships across and within sites, with elevated PM2.5 occurring in both low- and high-proxy contests. WRND-PM2.5 correlations were weak overall and statistically significant only in Mozambique (r = 0.351; p = 0.005), with non-significant associations in Kenya (r = -0.041; p = 0.673) and The Gambia (r = -0.020; p = 0.909). EH-PM2.5 correlations were positive in The Gambia (r = 0.335; p = 0.053) and Mozambique (r = 0.292; p = 0.020) but negative and significant in Kenya (r = -0.224; p = 0.018).Single-variable RF models performed poorly across all countries (R2 < 0.45) and the Global model (R2=0.42). Combining proxies improved performance in Kenya (R2=0.52; RMSE=31.7{micro}g/m3) and Mozambique (R2=0.60; RMSE=8.9 {micro}g/m3), Global R2=0.46; RMSE=29.1 {micro}g/m3), although in The Gambia, the combined model (R2=0.53; RMSE=37.6 {micro}g/m3) did not exceed the best single-proxy model. ConclusionRoad-network proxies provide context-dependent signals of personal PM2.5 exposure, and predictive performance is strengthened when proxies are combined in a hybrid model.

10
A general methodology for liver sinusoid fenestration analysis based on 3D electron microscopy data

Pohar, C.; Rekik, Y.; Phan, M. S.; Gallet, B.; Desroches-Castane, A.; Chevallet, M.; Tinevez, J.-Y.; Tillet, E.; Vigano, N.; Jouneau, P.-H.; Deniaud, A.

2026-03-09 cell biology 10.64898/2026.03.07.710307 medRxiv
Top 0.1%
1.7%
Show abstract

The liver has a complex architecture composed of millions of lobules. Within these lobules, hepatocytes, the main hepatic cells, are organized in rows separated by blood capillaries known as sinusoids. These capillaries are lined by liver sinusoidal endothelial cells (LSEC) that form a very specific fenestrated endothelium essential for the exchange of metabolites and proteins between the blood and hepatocytes. Alterations in the size and number of LSEC fenestrations are associated with the onset and the progression of various liver diseases. The analysis of liver architecture is thus of utmost importance for advancing our knowledge of liver ultrastructure and its alterations. Liver architecture has been studied since decades, mainly using 2D electron microscopy, and more recently using advanced super-resolution fluorescence microscopy. In recent years, volume electron microscopy techniques, including focused ion beam-scanning electron microscopy (FIB-SEM) progressed and nowadays enable the 3D reconstruction of biological ultrastructures down to nanometer resolution. However, the analysis of large volumes (e.g., several tens of {micro}m3) remains challenging due to various constraints in the segmentation of large datasets. In the current study, we developed a workflow to semi-automatically segment hepatic sinusoids from FIB-SEM mice liver datasets using the CNN-based (convolutional neural network) tool known as "nnU-Net", after fine-tuning a ground truth model. We also implemented tools for semi-automatic quantification of LSEC fenestrae diameters and sinusoid porosity from segmented datasets. This workflow enabled us to compare the distribution of LSEC fenestrae diameters in wild-type versus Bmp9-deleted mice, a hepatic factor known to be involved in fenestration maintenance. Our results confirm the importance of BMP9 for LSEC differentiation. Therefore, the developed methodology represents a valuable tool for characterizing the fenestrated endothelium under various physiological and pathological conditions.

11
Benchmarking LLM-based Information Extraction Tools for Medical Documents

Yu, A.; Weile, J.; Courtot, M.

2026-01-22 health informatics 10.64898/2026.01.19.26344287 medRxiv
Top 0.1%
1.7%
Show abstract

MotivationMedical documents are a crucial resource for medical research around the world. While troves of valuable health data exist, they are largely computationally inaccessible as hard copies of unstructured text. Moreover, the persistent prevalence of fax machines in medical settings contributes to further degradation of document quality. Digitization of these resources through manual data extraction is time-consuming and resource intensive. However, large language models (LLMs) have recently shown great promise for automated digitization and information extraction (IE), greatly improving upon previous tools in terms of speed and accuracy. ResultsWe reviewed recent LLM-based tools for named entity recognition (NER) and IE from the literature and assessed them with respect to their suitability for use in a clinical setting. We found only two of these tools to be usable out of the box and compared them to LLM foundation models prompted to perform extractions. Using 1000 mock medical documents with paired reference data, we evaluated the tools performance in different scenarios, comparing zero-shot and one-shot prompts as well as unimodal and multimodal (image and text) inputs where possible. The most effective model was OpenAIs GPT 4.1-mini with an average F1 score of 55.6. The best performing local model was Googles Gemma3 with 27B parameters, given image inputs and a zero-shot prompt, with an average F1 score of 41.3. We found the choice of prompting strategy to have minimal impact on extraction performances. We also assessed the effects of image distortions commonly introduced by fax machines and found a significant impact on extraction performance. AvailabilitySource code and data are available on Github at https://github.com/courtotlab/PDF_benchmarking. Supplementary informationSupplementary data are available at Journal Name online.

12
Mapping the North American Terrestrial Carbon Cycle: A Process-based Reanalysis Using State Data Assimilation (SDA)

Zhang, D.; Huggins, J.; Li, Q.; Ramachandran, S.; Serbin, S.; Webb, C.; Zuo, Z.; Dietze, M. C.

2026-02-26 ecology 10.64898/2026.02.25.708030 medRxiv
Top 0.1%
1.5%
Show abstract

AbstractThe ability to accurately assess ecosystem C budgets across scales from individual sites to continents is essential for C accounting, management, and ultimately mitigating climate change. State data assimilation (SDA) provides a framework for harmonizing observations with models, while robustly accounting for and reducing multiple sources of uncertainty. In this study, we employed a hybrid SDA framework that combines process-based terrestrial biosphere modeling, hierarchical Bayesian inference, and machine learning to harmonize bottom-up and remotely-sensed data streams for 8,000 pre-selected 1km2 locations across North America within a hybrid structure. Combining bottom-up soils data (SoilGrids) with spectral (MODIS and Landsat) and microwave (SMAP) remote sensing helps constrain the major C and water stocks through space and time. Machine learning is used both to identify and correct systematic errors in the process model (SIPNET) and to interpolate the pre-selected locations onto a 1km grid, making it computationally feasible to generate annual ensemble maps of the North American carbon budget. Furthermore, the uncertainties for each variable were reduced compared to those from observations or models alone. Spatiotemporal analysis showed a slight decrease in aboveground biomass (AGB) across the western US, a loss of leaf area across the boreal, and a slight greening of the Alaskan tundra. The uncertainty trends suggest a significant reduction in the uncertainty about soil organic carbon (SOC), the largest C reservoir. Validation results show that we accurately estimate C pools, compared to the assimilated data streams and held-out observations of AGB from GEDI, ICESat-2, and the US FIA, and SOC from the ISCN network. Our ML-debiasing algorithm further improved the accuracy of major C pools (AGB, SOC). In general, our continental SDA framework will facilitate global C MRV (monitoring, reporting, and verification) by providing accurate and precise C-cycle estimates, along with their corresponding spatiotemporal uncertainties.

13
Correlative scanning electron and super-resolution structured illumination microscopy

Hamilton, J. R.; Levis, S.; Hagen, G. M.

2026-02-11 biophysics 10.64898/2026.02.09.704937 medRxiv
Top 0.1%
1.5%
Show abstract

Correlative microscopy techniques are used for many different applications in the biological sciences because the comparison of different imaging methods allows researchers to gain more insight and data from samples. Correlative light and electron microscopy (CLEM) methods have been developed to preserve biological samples to withstand the harsh environments necessary for electron microscopy. After first being imaged using widefield (WF) and super-resolution structured illumination fluorescence microscopy (SIM), a NanoSuit chemical treatment was applied to a mammalian testis sample before imaging with scanning electron microscopy (SEM). This was done to compare the image quality and resolution of each technique. SEM yields higher resolution and offers validation of results from SIM.

14
Beyond dairy: Identification of dental enamel proteins in ancient human dental calculus

Leite, A.; Welker, F.; Godinho, R. M.; Gillis, R. E.; Islas, V. V.; Fagernas, Z.

2026-03-24 evolutionary biology 10.64898/2026.03.21.713223 medRxiv
Top 0.1%
1.5%
Show abstract

Ancient human dental calculus is one of the richest archives of archaeological biomolecular information, providing direct evidence of diet, oral health, and the oral microbiome. Proteomic analyses of this biological matrix have so far focused mainly on oral microbes and dietary proteins, with milk proteins such as beta-lactoglobulin (BLG) providing the largest corpus of proteomic evidence. Despite the close relation between the various stages of dental calculus formation and mineralization with the dental enamel surface, proteins from the dental enamel matrix have not previously been reported outside of dental enamel tissue. Here we reanalysed 498 ancient dental calculus proteomes from 14 published studies (n=434 individuals) reporting the presence of BLG, spanning from the Neolithic to the Victorian Era and applying different protein extraction protocols (FASP, GASP, SP3 and in-solution digestion). Dental enamel matrix proteins were identified in ten studies (n=37 individuals), with amelogenin being the most frequently detected. Enamel peptides occurred more often in studies that applied SP3, although amelogenin was successfully identified through both SP3 and FASP. Structural proteins, including enamelin, ameloblastin, and MMP20, were also identified. The detection of AMELX and AMELY peptide sequences provided new insights into cases where the sex was previously undetermined. These findings establish dental enamel proteins as a new category of biomolecules detected in dental calculus, broadening its application beyond diet and microbiome studies to possible sex estimation. HighlightsO_LIDental calculus entraps oral microbes along with endogenous and exogenous particles during formation and mineralization C_LIO_LIWe conduct reanalysis of 14 published ancient dental calculus studies (n = 434 individuals) spanning the Neolithic to Victorian Era C_LIO_LIDental enamel proteins AMELX, AMELY, AMBN, COL17A1, ENAM and MMP20 are identified in ancient human dental calculus C_LIO_LIAmelogenin was the most frequently detected enamel protein C_LIO_LIWe expand dental calculus palaeoproteomics beyond diet and oral microbiome to potentially include sex estimation C_LI

15
Transcriptomic data of larval zebrafish exposed to continuous sub- and supra-MCL sodium arsenite and uranyl nitrate.

Kalaniopio, P. H.; Allen, R. S.; Salanga, M.

2026-02-23 pharmacology and toxicology 10.64898/2026.02.22.707205 medRxiv
Top 0.1%
1.4%
Show abstract

Uranium (U) and arsenic (As) are both ubiquitous contaminants in the American southwest, posing risks to humans, animals, and the environment. Depleted uraniums (DU) chronic effects and mechanisms of toxicity are incompletely understood. Differential gene expression of concomitant exposures to identify markers of toxicity have not been undertaken until now. Continuous low-dose, high-dose, and concomitant exposures are investigated using the larval zebrafish (Danio rerio), with exposure paradigms lasting from embryo collection until sampling at 5 days post fertilization (dpf). Herein, we describe overall differential gene expression with counts and pathway enrichment statistics using both gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses. The raw dataset has been deposited in NCBIs Gene Expression Omnibus (GEO) repository [1] under the accession number GSE319292 [2]. O_TBL View this table: org.highwire.dtl.DTLVardef@9b121aorg.highwire.dtl.DTLVardef@c17073org.highwire.dtl.DTLVardef@1bdc2b9org.highwire.dtl.DTLVardef@13b130aorg.highwire.dtl.DTLVardef@15f1d22_HPS_FORMAT_FIGEXP M_TBL C_TBL VALUE OF THE DATAO_LIUranyl nitrate (UN), a water-soluble depleted uranium species, and sodium arsenite (As) are both ubiquitous contaminants in the American southwest, posing risks to humans, animals, and the environment. The United States Environmental Protection Agency (EPA) has set maximum contaminant limits (MCL) of 30 ppb U atoms and 10 ppb As atoms, respectively. C_LIO_LIThese data show differentially expressed genes (DEGs) from larval zebrafish exposed to 1 or 10 {micro}M As, 30 or 300 {micro}g/L UN, or 1 {micro}M As and 30 {micro}g/L UN in combination. Concentrations were specifically chosen based on environmental relevance. C_LIO_LIGene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses of up- and down-regulated DEGs are provided to understand the molecular mechanisms of uranium toxicity and inform future studies. C_LIO_LIThese data should be used for biomarker identification and mechanistic interrogation of single and combinatorial exposures of environmentally relevant compounds at realistic exposure levels. C_LI

16
Biodesign Buddy: Integrating Generative Artificial Intelligence in Academic Biodesign

Riffle, D.; Rubery, P.

2026-03-13 scientific communication and education 10.64898/2026.03.11.710906 medRxiv
Top 0.1%
1.4%
Show abstract

Biodesign is an interdisciplinary research domain that incorporates principles from design and the life sciences to develop new systems, processes, and objects. Collegiate biodesign educators face unique pedagogical challenges, including an absence of relevant scholarship on curriculum design and instructional best practices for cultivating student scientific literacy. These difficulties may be overcome with newly available technologies, like generative AI systems, that enable personalized learning through domain-specific semantic spaces. This article examines the instructional value of one such domain-specific LLM, Biodesign Buddy, through a mixed-methods analysis of an eight-week study involving 64 students participating in an international biodesign competition. Results indicate strong support for integrating AI into biodesign coursework. Surveys captured attitudes toward AI, scientific literature, and learning experiences to assess AIs impact on learning outcomes. Findings suggest that integrating AI into biodesign pedagogy can meaningfully redress conceptual issues in biodesign while informing broader debates on AIs role in higher education. Impact StatementThis article introduces Biodesign Buddy, a domain-specific generative AI system for collegiate biodesign education, and reports on its exploratory deployment, offering design principles and preliminary findings to inform the development of AI-supported pedagogies for interdisciplinary biodesign instruction.

17
TaxonMatch: taxonomic integration and tree construction from heterogeneous biological databases

Leone, M.; Rech De Laval, V.; Drage, H. B.; Waterhouse, R. M.; Robinson-Rechavi, M.

2026-03-20 evolutionary biology 10.64898/2026.03.18.712418 medRxiv
Top 0.1%
1.3%
Show abstract

Integrating taxonomic data from various sources presents a significant challenge in the study of biodiversity research, due to non-standardized nomenclature and evolving species classifications. Discrepancies between major repositories like the Global Biodiversity Information Facility (GBIF) and the National Center for Biotechnology Information (NCBI), as well as citizen science platforms such as iNaturalist, lead to fragmented and sometimes inaccurate biological data. We present TaxonMatch, a tool designed to address these challenges. TaxonMatch aligns taxonomic names, resolves synonymy, and corrects typographical and structural inconsistencies across databases. We show how it can be used to build a common backbone arthropod taxonomy over NCBI, GBIF and iNaturalist, to find the closest molecular data to a given fossil, and to identify IUCN endangered species with molecular data. TaxonMatch provides a cohesive taxonomic framework and a consistent taxonomic backbone, and can be applied to any taxonomic source. The tool is available at https://github.com/MoultDB/TaxonMatch.

18
DEDuCT 3.0: An enhanced and expanded FAIR-compliant resource and toxicology knowledge graph for endocrine disrupting chemicals

Chivukula, N.; Vashishth, S.; Kandasamy, P.; Madgaonkar, S. R.; Samal, A.

2026-01-26 pharmacology and toxicology 10.64898/2026.01.23.701267 medRxiv
Top 0.1%
1.3%
Show abstract

Endocrine disrupting chemicals (EDCs) are of particular regulatory and research interest due to the increasing incidence of endocrine-related disorders, such as declining fertility rates and reproductive health problems. The Database of Endocrine Disrupting Chemicals and their Toxicity Profiles (DEDuCT) has gained importance in both academic and regulatory settings by systematically curating data from published literature to characterize these chemicals. Given the growing body of EDC literature, this study aimed to consolidate the latest research and update this critical database. First, more than 14000 research articles were screened through an extensive four-stage manual process, and integrated with the earlier version to create the updated DEDuCTv3.0, comprising 1043 unique EDCs and 796 unique endocrine-related endpoints curated from 3269 published articles. Thereafter, human- and rodent-specific biological endpoint data including interacting genes/proteins, phenotypes, diseases, and adverse outcome pathways (AOPs) were curated from toxicology-relevant databases and systematically integrated with DEDuCTv3.0 to construct a large-scale toxicology knowledge graph for EDCs, termed DEDuCT-KG. DEDuCT-KG was then hosted on a Neo4j database and made easily accessible through a novel interactive user interface. The utility of DEDuCT-KG was demonstrated by exploring potential mechanisms of action associated with obesogenic EDCs within DEDuCTv3.0. Furthermore, the constructed EDC-AOP network, linking 949 EDCs to 381 AOPs within AOP-Wiki, revealed diverse toxicity mechanisms associated with EDCs. Integration with consumer product database and regulatory chemical lists showed that some of these EDCs are present in food contact materials, personal care products, and daily use items, highlighting potential exposure pathways. Overall, all data compiled in this study have been integrated into the DEDuCT webserver, which has been further enhanced to align with FAIR principles. In sum, this study provides a much-needed update to DEDuCT and offers a single point of access to EDC-relevant data to accelerate research and regulation of EDCs.

19
A high-performance end-to-end 3D CLEM processing workflow for facilities

Roberge, H.; Woller, T.; Pavie, B.; Hennies, J.; de Heus, C.; Edakkandiyil, L.; Liv, N.; Munck, S.

2026-03-16 cell biology 10.64898/2026.03.13.711046 medRxiv
Top 0.1%
1.3%
Show abstract

Correlative Light and Electron Microscopy (CLEM) integrates the molecular specificity of light microscopy (LM) with the ultrastructural detail of electron microscopy (EM), enabling comprehensive spatial analysis of biological samples. Despite growing demand, processing 3D CLEM datasets remains challenging, specifically for service provision in facilities, due to their multimodal nature and the lack of unified approaches. Typical steps include EM slice alignment, LM-EM registration, segmentation, and 3D visualization. We present a modular, end-to-end pipeline that consolidates existing and newly developed tools into a coherent workflow for 3D CLEM analysis and allows railroading the approach. Designed as interoperable modules accessible through a user-friendly interface, the pipeline is fully open-source and scales from standard workstations to high-performance computing environments to address the need for analysis of growing datasets. While some steps still require manual input, individual components can be automated to increase throughput and reproducibility. Together, this integrated solution lowers technical barriers and supports broader adoption of 3D CLEM methodologies.

20
The Global Imbalance in Telemedicine Research: An Analysis of Knowledge Production and Socioeconomic Drivers

Aarabi, S. S.; Semnani, F.; Sedaghat, M.

2026-03-02 public and global health 10.64898/2026.02.27.26347284 medRxiv
Top 0.1%
1.3%
Show abstract

BackgroundThis study aims to explore disparities in telemedicine research, investigate the impact of the COVID-19 pandemic on these inequalities, and examine the association between various socioeconomic factors and telemedicine research output across Low- and Middle-Income Countries (LMIC) and High-Income Countries (HIC), and World Health Organization (WHO) regions. MethodsA comprehensive search strategy was developed to identify telemedicine-related documents (2018-2022) in Scopus and SciVal, with false positives and negatives resolved. Mann-Whitney U and Wilcoxon Signed Rank tests compared publication volume and Field-Weighted Citation Impact (FWCI). A novel metric, Research Interest (RI), was calculated by dividing telemedicine publications by total outputs in medicine and life sciences. WHO regions were ranked using TOPSIS. Spearman Rank Correlation assessed links between socioeconomic variables and research output separately in HIC and LMIC. Analyses were conducted using R (v4.3.2). ResultsWe retrieved 16,584 telemedicine-related articles: 4,244 from 58 LMIC and 13,622 from 47 HIC, including 1,282 collaborative publications (30% of LMIC and 9.4% of HIC outputs). HIC consistently produced more publications than LMIC. While FWCI differences were significant in the pre-COVID era (Cliffs Delta = 0.48), no significant difference was observed post-COVID. RI for telemedicine showed no significant difference between HIC and LMIC in any timeframe. The Western Pacific led in quality metrics, while the Americas ranked highest overall. Southeast Asia ranked lowest in both. Exclusively among HIC, Health Expenditure (Purchasing Power Parity adjusted) (r = 0.63, r = 0.45) and Human Development Index (r = 0.50, r = 0.47) were moderately, and ICT service exports (USD) (r = 0.72, r = 0.33) were strongly correlated with both telemedicine scientific output and RI. ConclusionGlobal inequalities in telemedicine research favor HIC, though the gap narrowed post-COVID. Among HIC, telemedicine research patterns more proportionately reflect socioeconomic indicators, research capacity, infrastructure, and domestic health needs.